Mining Frequent Most Informative Subgraphs
نویسندگان
چکیده
The main practical problem encountered with frequent subgraph search methods is the tens of thousands of returned graph patterns that make their visual analysis impossible. In order to face this problem, are introduced a very restricted family of relevant graph patterns called the most informative patterns along with an algorithm to mine them and associated experimental results. In graph-based data mining problems, mined patterns are connected labelled graphs isomorphically distinct. Several algorithms have been proposed [1–4] to mine frequent graph patterns in graph databases by analogy with the frequent itemset search problem. Given a graph database, the frequency of a graph pattern is the number of graphs in the database containing at least one subgraph isomorphic to the pattern. The frequent subgraph search problem consists in determining the set of frequent patterns whose frequency is higher than a minimum threshold along with their frequency. The main practical problem encountered with frequent patterns in general and frequent subgraphs in particular is their huge number, resulting in the impossibility for the expert to analyse visually every returned pattern. Increasing the minimal frequency threshold and mining only the few most frequent graphs does not help since most frequent graphs, following the example of the empty graph, are generally the least interesting as well. A better solution consists in extracting a restricted but informative subset of frequent patterns, such as the subset of frequent closed patterns [5]. However even dense and relatively small graph data sets still contain several thousands of frequent closed patterns [5]. Information theory better defines what informative means: informative patterns are simultaneously discriminating (i.e. subsuming a small subset of all possible object descriptions) and representative (i.e. having a high frequency). Since the frequency decreases when the pattern description grows, the extraction of most informative patterns is necessarily an optimization problem that selects patterns providing the best optimum between their frequency and the information contained in their description. In a very general setting, the quality of this compromise can be assessed by an arbitrary scoring function associating a score to every pattern. Higher is the score, better is the compromise. Given a graph database and a scoring function, the proposed method implemented in the Forage software platform searches within the set of frequent graph patterns, the most informative patterns defined as the local maxima of the scoring function in the patterns order. 2 Frédéric Pennerath, Amedeo Napoli The Most Informative Pattern Extraction Problem Frequent subgraph mining considers a set G of isomorphically distinct graph patterns whose vertices and edges are labelled in a set L. Because two isomorphic patterns of G are necessarily equal, the isomorphic subgraph relation ≤G is an ordering relation over G. Given a database D of graphs labelled in L, the frequency freq(p) of a pattern p ∈ G is the number of elements of D which contain at least one subgraph isomorphic to p. A pattern is frequent if its frequency is greater or equal to a given threshold. A most informative pattern is then defined relatively to a scoring order and an informative scoring function: Definition. Given a database D and a total or partial ordered set (S,≤S) called the scoring order, a scoring function is a function s : G×IN→ S mapping the pair (p, f) of a pattern p and its frequency f in D to the score s(p, f) of p relatively to D. A scoring function s is informative if every function sf : p 7→ s(p, f) for any f ∈ IN (resp. every function sp : f 7→ s(p, f) for any p ∈ G) is an increasing function of p (resp. of f). The frequent most informative pattern extraction problem then consists in finding every frequent pattern whose the score is a local maximum in (G,≤G). Definition. A pattern p is a most informative pattern relatively to a scoring order (S,≤S) and an informative scoring function s if and only if no immediate predecessor or immediate successor p′ of p in (G,≤G) verifies s(p′, freq(p′)) >S s(p, freq(p)). Closed patterns appear as the most informative patterns relatively to the scoring order (G,≤G)× (IN,≤) and the informative scoring function sc : (p, f) 7→ (p, f). The closed pattern extraction has actually the most conservative filtering scoring function among all possible ones. A more selective scoring function defines the said most synthetic patterns. Definition. The most synthetic patterns are the most informative patterns relatively to the scoring order of positive real numbers (IR,≤) and the informative scoring function sp : (p, f) 7→ |p| · f , where |p| denotes the size of pattern p (i.e. for a graph pattern, the size is the number of vertices and edges). The function sp is a rough compression gain estimator that approximately estimates the space saved when every occurrence of a pattern p in D is replaced by a single vertex identified by a special label. This information criterion follows the minimum description length principle found in Subdue [6]. However Subdue is an incomplete graph beam search algorithm that uses the information criterion to converge towards some interesting patterns whereas Forage is a complete graph mining algorithm returning all frequent most informative patterns and only them. Test results have shown frequent most synthetic patterns are typically few tens where frequent closed patterns are several thousands. 3 p′ is an immediate predecessor (resp. immediate successor) of p if p′ < p and p′ ≤ p′′ < p⇒ p′′ = p′ (resp. p′ > p and p′ ≥ p′′ > p⇒ p′′ = p′). 4 Given two orders (E1,≤1) and (E2,≤2), the product order (E1×E2,≤12) is defined by: (x1, x2) ≤12 (y1, y2) if and only if x1 ≤1 y1 and x2 ≤2 y2. Mining Frequent Most Informative Subgraphs 3 A Most Informative Pattern Extraction Method The Forage algorithm has been developed to extract from a set of labelled graphs the frequent most informative graph patterns relatively to a parameterizable scoring function. It can therefore be used to extract frequent closed subgraphs and frequent most synthetic subgraphs. Because every frequent pattern must have its score compared with the ones of all its immediate predecessors and successors, every immediate successor p′ of the currently mined pattern p must be generated when p is frequent. The canonical graph c of p′ is then computed using an algorithm similar to Nauty [7] and is used as a key to retrieve from a trie the entry of pattern p′. Each entry stores the pattern score and a boolean flag. This flag is initialized to true if and only if p′ is frequent and later gets false if any immediate predecessor or successor of p′ has a greater score than p′. If an entry for p′ already exists, a pattern isomorphic to p′ was previously generated and the algorithm thus backtracks. Otherwise the frequency of p′ is computed using embedding list [4], the score of p′ is estimated and is used to initialize a new entry added to the trie for the key c. If p′ is found frequent, the algorithm is recursively called for all successors of p′ so that all frequent patterns are processed in a depth first search order. In every case, the boolean flags of p and p′ are updated by comparing their respective scores. At the end of the recursion, patterns with a true flag are extracted from the trie and returned as the frequent most informative patterns. Whereas algorithms such as gSpan [3] or Gaston [4] use efficient optimizations to reduce redondant generations of distinct but isomorphic patterns, Forage has to generate the whole order diagram (or Hasse diagram) to filter the most informative patterns without any possible form of optimization. This substantially increases the algorithm complexity and the required processing time as the price to pay for an efficient pattern selection. The algorithm has been tested on various data sets containing few hundreds to few thousands molecular graphs or reaction graphs derived from chemical reaction equations. One of these tests consisted for instance in mining frequent reaction patterns in a set of 200 reaction graphs. This set is actually composed of two subsets of 100 similar graphs representing two distinct well known families of reactions (i.e acetoacetic ester and Sonogashira synthesis methods). Gaston is used as a reference to mine the frequent reaction patterns whereas Forage mines both sets of frequent closed patterns and frequent most synthetic patterns. As expected, processing times on Fig. 1 (b) show Forage is about 100 hundreds times slower than Gaston in searching frequent patterns. However the quality of returned most synthetic patterns makes up for Forage ’s slowness: figure 1 (a) shows the number of frequent most synthetic patterns slowly grows ranging from 5 to 21 whereas at the same time the numbers of frequent patterns and frequent closed patterns exponentially grow, respectively from 2000 to 20000 and from 70 to 1000. A qualitative analysis shows that among the 21 found frequent most synthetic patterns, 7 of them are characteristic of the two synthesis methods, 3 others represent the three resonance forms of aromatic cycles and only the 11 remaining ones are apparently uninteresting reaction patterns. These figures are to compare with the hundreds of irrelevant frequent closed patterns. 4 Frédéric Pennerath, Amedeo Napoli Fig. 1. Patterns distribution (a) and processing time (b) In conclusion, the first experimental tests have emphasized the quality of the frequent most synthetic patterns as patterns being simultaneously very representative of the database content and very rare: they are typically only few tens where frequent closed patterns are thousands. In order to further validate the interest of such patterns and to better understand the computation limits, tests must now be generalized to non homogeneous graph databases (i.e. such as general chemical reaction databases with lower density) and the influence of various scoring functions on the quality of patterns must be evaluated.
منابع مشابه
Feature Selection in Frequent Subgraphs Feature Selektion auf häufigen Subgraphen
Bioinformatics is producing a wealth of network data, ranging from molecular graphs to complex gene expression networks. To distinguish different classes of graphs, such as different functional classes of proteins, one common approach is to search for common frequent subgraphs. However, this method suffers from the fact that it quickly generates thousands or even millions of frequent subgraphs....
متن کاملOutput Space Sampling for Graph Patterns
Recent interest in graph pattern mining has shifted from finding all frequent subgraphs to obtaining a small subset of frequent subgraphs that are representative, discriminative or significant. The main motivation behind that is to cope with the scalability problem that the graph mining algorithms suffer when mining databases of large graphs. Another motivation is to obtain a succinct output se...
متن کاملFS3: A sampling based method for top-k frequent subgraph mining
Mining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task which is computationally expensive, so they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we pr...
متن کاملA Survey on Algorithms of Mining Frequent Subgraphs
–Graphs are currently becoming more important in modeling and demonstrating information. In the recent years, graph mining is becoming an interesting field for various processes such as chemical compounds, protein structures, social networks and computer networks. One of the most important concepts in graph mining is to find frequent subgraphs. The major advantage of utilizing subgraphs is spee...
متن کاملVisCFSM: Visual, Constraint-Based, Frequent Subgraph Mining
Graphs long have been valued as a pictorial way of representing relationships between entities. Contemporary applications use graphs to model social networks, protein interactions, chemical structures, and a variety of other systems. In many cases, it is useful to detect patterns within graphs. For example, one could be interested in identifying frequently occurring subgraphs, which is known as...
متن کامل